In this file, we import the already clean data (where we’ve also made imputations for missing values) containing all individual input variables. Then, we import the parameters of the best performing machine-learning model that we can use to predict future happiness. Before we generate the predictions, we apply a series of country-level regression models that project the historical values on to the next two years, thus providing the inputs we need for applying our machine learning model.
Importing relevant packages, defining custom functions, specifying local folders etc.
# Importing relevant packages
# For general data-related tasks
library(plyr)
library(tidyverse)
library(data.table)
library(openxlsx)
library(readxl)
library(arrow)
library(zoo)
# For statistical analysis and ML
library(modelr)
library(randomForest)
# For data visualization
library(plotly)
library(ggplot2)
library(gridExtra)
Below, we import historical data on happiness and various background
variables, where imputations for missing data has already been
performed. In addition, we import the parameters we need for fitting the
best performing machine learning model (as evidenced by our tests in the
ML_modelling.Rmd notebook).
Here, we specify how many years into the future to generate predictions for. It should be noted that the farther into the future we go, the less certain the predictions become.
## [1] "Note: predictions will be generated for the years 2023-2024."
We also specify whether to test all possible models for the data used as input to our machine learning model or whether to import the results of a previous test.
## [1] "Results from previously model tests will be imported."
## [1] "Note: this assumes that we're using more or less the same input data."
Before we can proceed to fitting our random forest model, we need to make sure that we have input data in future time periods. Unfortunately, such data is not directly available, however, very good approximations can be obtained by projecting the trends found in the historical data to future time periods.
The decision of exactly which variables to include in our predictive
model(s) is based on our findings from the
Ridge_regression_analysis.Rmd notebook, where we explored
different combinations of variables.
A full list of all input variables fed into the models tested below is presented in here:
## [1] "E_GDPPerCapitaConstant" "E_PovertyGap685Headcount"
## [3] "E_HealthExpenditurePerCapita" "E_ExportsPctOfGDP"
## [5] "E_GiniIndex" "E_EducationExpenditurePctOfGDP"
## [7] "E_LaborTaxPctOfProfits" "E_ConsumerPriceInflation"
## [9] "E_FemaleUnemployment" "E_ImportsPctOfGDP"
## [11] "P_GovernmentEffectiveness" "P_ControlOfCorruption"
## [13] "P_RuleOfLaw" "P_CleanElectionsIndex"
## [15] "S_AccessToCleanFuelsPctOfTotal" "S_UrbanPopPctOfTotal"
## [17] "S_AccessToElectricityPctOfTotal" "S_UpperSecEduPctOfTotal"
## [19] "S_CompulsoryEducationYears" "S_LaborParticipRateFemale"
## [21] "S_PopulationAged14OrLess" "V_CO2PerCapita"
## [23] "V_FertilizerUseKgPerHectare" "V_AgriculturalLandPctOfTotal"
## [25] "V_ForestAreaPctOfTotal" "H_AirPollutionMeanExpPctOfPop"
## [27] "H_TotalSuicideRate"
The way we go about projecting historical trends into the future is described below:
Below, we loop through all countries in the dataset and all variables used as inputs in the machine learning model. We fit the following model types and then record their RMSE scores:
Linear model where the input variable is modeled as a function of time
Quadratic model where the input variable is modeled as a function of time
Cubic model where the input variable is modeled as a function of time
Autoregressive models where the input variable is modeled as a function of its previous value(s), including anywhere from 1 to 3 annual lags
Please note that this part may be computationally intensive and may
take 5-10 minutes to complete. The user may in this
connection specify not to test all possible models but import the
results from a previous test by adjusting the value of the
RefreshAllInputs variable.
## [1] "Note: imported 4023 best performing model results from previous run."
Here, we use the model test results generated/imported in the preceding section to identify and fit the best performing model for each country and variable. Following this, we generate predictions for the period 2023-2024. Finally, we consolidate the predictions of the background data into a single data frame that can be passed on to our machine learning model.
The user may in this connection specify not to repeat the process
upon each re-run of the notebook but to import the placeholder
data created in a previous run by adjusting the value of the
RefreshAllInputs variable.
## [1] "Successfully imported input data for ML model from a previous run."
Now that we have all inputs available for our ML model, we can proceed by fitting it on the training data and then using it to generate predictions for future time periods.
Finally, we import our pre-optimized random forest model and use it to generate predictions for the period 2023-2024.
## [1] "Successfully generated predictions for 298 rows of data."
## [1] "Clean dataset containing history and predictions exported to 'Data/Output'."
First, we would like to see what the happiest countries are based on their last predicted annual scores:
As was the case in historical data, we see a lot of Nordic and European countries among the happiest countries in the world; in fact, among the top 10 happiest in 2024, we only see European countries.
Looking at the least happy countries, the predicted happiness scores are not surprising either given the historical background:
Similarly to what we observed in the historical data, we see that a lot of countries suffering from internal/armed conflicts like Afghanistan, Lebanon and Yemen are expected to retain their relatively low levels of happiness even in 2024.
To be better able to compare happiness across the globe, we create an interactive color-coded world map where each country will get its own happiness score plotted with a different color shade. Unfortunately, we do not have data for all countries, so some states will be colored white due to missing observations.
In this notebook, we’ve demonstrated how we can use a predictive ML model to estimate the expected level of happiness in the future. In order to do so, we first needed to generate input data for future time periods that we can feed into the model. This was done at the country level, where trends for individual variables were projected into the next 2 years using the most appropriate regression model for the individual series. After we had our input data ready, the process of predicting future happiness scores was quite straightforward. Finally, we created some visualizations to help us make sense of the predictions, which turned out quite sensible given the historical context.